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Abstract 

Background: Rise of temperatures and shortening of available water as result of predicted climate change will 
impose significant pressure on long-lived forest tree species. Discovering allelic variation present in drought related 
genes of two Austrian oak species can be the key to understand mechanisms of natural selection and provide 
forestry with key tools to cope with future challenges. 

Results: In the present study we have used Roche 454 sequencing and developed a bioinformatic pipeline to 
process multiplexed tagged amplicons in order to identify single nucleotide polymorphisms and allelic sequences 
often candidate genes related to drought/osmotic stress from pedunculate oak {Quercus robur) and sessile oak (0. 
petroeo) individuals. Out of these, eight genes of 336 oak individuals growing in Austria have been detected with a 
total number of 158 polymorphic sites. Allele numbers ranged from ten to 52 with observed heterozygosity ranging 
from 0.1 15 to 0.640. All loci deviated from Hardy-Weinberg equilibrium and linkage disequilibrium was found 
among six combinations of loci. 

Conclusions: We have characterized 183 alleles of drought related genes from oak species and detected first 
evidences of natural selection. Beside the potential for marker development, we have created an expandable 
bioinformatic pipeline for the analysis of next generation sequencing data. 



Background 

White oaks are native to Europe, Asia, North Africa and 
North America, and include sessile oak, pedunculate 
oak, pubescent oak (Q. pubescens) and bur oak (Q. 
macrocarpa) among their most prominent species [1]. 
About two percent of the Austrian forests harbour oak 
trees which correspond to an area of 66,000 hectare. Q. 
petraea and Q. robur are the predominant species while 
Q. cerris (Turkey oak) and Q. pubescens play only a 
minor role [2]. In Austria as well as in Europe oak spe- 
cies colonize huge areas with vastly differing climatic 
conditions. 

Rise of temperatures and shortening of available water 
as result of predicted climate change will impose signifi- 
cant pressures on long-lived forest tree species like 
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European white oak. According to the latest predictions, 
we expect the average global surface temperature to in- 
crease by a maximum of 6.4°C within the next 90 years 
[3] leading to a higher frequency of severe drought 
events. But it is well known that different species diverge 
in their ability to resist drought induced damages and 
even within a species there is tremendous variability 
[4,5]. Inter- as well as intraspecific allelic diversity is the 
key element of a plants potential to adapt to a changing 
environment and tolerance towards drought stress. There 
is an increased demand for molecular tools helping to de- 
scribe these variations and to prepare forestry for future 
challenges. 

Molecular markers are the first choice for plant research 
and breeding [6]. Using them as landmarks, genetic maps 
can be established and subsequently used for identifi- 
cation of traits controlled by different genes (quantita- 
tive trait loci). Marker assisted selection provides 
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breeders with an efficient tool for identifying desired 
phenotypes in large populations. Beside their use in 
breeding, molecular markers are highly valuable for 
population genetics permitting evolutionary studies 
and population structure interference. 

Single nucleotide polymorphisms (SNPs) are com- 
monly used for functional diversity assessment. Although 
they are highly abundant in the human genome and 
every SNP locus could potentially serve as utile marker, 
there are still few studies dealing with a high number of 
SNPs in plants [7]. Latest advances in human and animal 
genome analysis have created several sophisticated tech- 
nologies which are capable to analyze millions of SNPs 
in reasonable time and with low costs [6] and can be eas- 
ily transferred to applications in plants. SNP discovery 
technologies include bioinformatic mining of expressed 
sequence tag (EST) databases [8], array based methods 
[9], comparison of whole genomes [10] and the applica- 
tion of next generation sequencing (NGS) for amplicon 
resequencing [11]. Recently developed sequencing tech- 
nologies summarized as next generation sequencing have 
already replaced traditional methods for detecting poly- 
morphisms in genomes [12]. 

Roche 454 sequencing technology [13] is based on sin- 
gle strand amplification with emulsion polymerase chain 
reaction followed by pyrosequencing. Average read- 
lengths of around 400 bp and high achievable coverage 
makes this technology well suited for discovery of SNPs 
and even detection of rare alleles. Short oligonucleotide 
barcodes can be used to tag individual sequences and en- 
able the parallel analysis of several targets which has 
been successfully demonstrated in different species [14- 
16]. Multiplexing capabilities enable large studies includ- 
ing several hundred individuals with a high coverage for 
each sample. Therefore extensive cloning procedures to 
identify both haplotypes of diploid individuals as used in 
Sanger sequencing or generation of inbred lines [17] can 
be avoided. Bioinformatic haplotype interference with 
parsimony [18] or maximum-likelihood methods [19] is 
no longer necessary because each haplotype will be cov- 
ered by a sufficient number of reads. The present paper 
describes the discovery and characterization of alleles in 
ten drought stress related genes originating from two 
oak species growing in Austria based on multiplex 454 
amplicon sequencing and the development of a bioinfor- 
matic analysis pipeline. 

Results 

Processing of 454 sequencing data 

Raw data with a total number of 253,630 reads comprising 
57.8 Mbp with an average length of 227.85 bp was deliv- 
ered by the sequencing company (Table 1). Average 454 
quality score of the provided sequences was 22.95 (median 
25.19) with a maximum of 37.93 and a minimum of 5.50. 



We removed 45,039 reads shorter than 90 bp (Figure 1) 
with an average length of 67.93 bp. A total number of 298 
sequences longer than 420 bp with an average length of 
447.21 bp and a maximum of 705 bp were trimmed. Aver- 
age 454 quality score after removal of long and short 
sequences increased to 27.78 (median 28.24). Barcode 
sequences were not readable in 5,108 reads which were 
therefore discarded. Additionally we identified 1,863 reads 
with corrupt internal primers which were excluded from 
further analysis (Figure 1). After preprocessing 201,620 
reads consisting of 53 Mbp sequence information with an 
average length of 230.35 bp remained for allele detection 
(Table 1). 

During homopolymer (HP) correction in SCARF [20] 
17,287 additional reads were removed because the soft- 
ware was not able to align them to the Sanger reference 
[21]. On average 18 individuals per locus were removed 
from the analysis (Table 2) because of technical errors 
including incorrect amplification, short reads, read qual- 
ity and corrupted barcodes or internal primers. A max- 
imum number of 28 lost individuals was found in BMY7 
while in PIP1E we only lost 7 individuals. 

Polymorphic sites and allele identification 

Within the sequences of GLP3A we detected more than 
2 alleles per individual. A minimum number of one allele 
and a maximum number of 12 alleles per individual with 
an average of 2.6 were found. GLP3A was excluded from 
further analysis because we suspected the amplification 
of a gene family or pseudogenes. Including individuals 
with less than four reads covering 70 % of the Sanger 
reference 65,181 reads were removed from the 201,620 
reads remaining after preprocessing. Due to low coverage 
an average number of 18 individuals had to be excluded 
from further analysis with a maximum number of 99 in 
DHN2 and a minimum number of two found in LEA 14 
and PER64 (Table 2). DHN2 was not included in the fur- 
ther analysis because we considered the number of indivi- 
duals lost due to low coverage (99) as too high. Excluding 
DHN2, the average percentage of recovered individuals 
was 92.3%. On average each gene fragment was covered by 
46.09 reads. In total we detected 158 polymorphic sites 
[SNPs and Insertions/Deletions (InDels)] among the eight 
remaining candidate genes (Table 3) with an average of 
19.75 sites per gene. The highest number was found in 
LEA14 and LTP (34), whereas a minimum number of 
seven polymorphic sites was detected in PER64. On aver- 
age 6.7 polymorphic sites occur each 100 bp with a mini- 
mum of 2.8/100 bp in ARE 16 and a maximum of 13.7/ 
100 bp in LTP. We found an average number of 22 alleles 
per gene with a maximum of 52 in BMY7 and a minimum 
of ten in ARF16, PER64 and RD26. The total number of 
detected alleles was 183. Sequences of alleles as well as 
genotypes of the analyzed individuals can be found in 
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Table 1 Comparison between raw data and preprocessed reads 



Pool 


Number of reads 


Received 
Base pairs (Mbp) 


Average length 


Number of reads 


Preprocessed 
Base pairs (Mbp) 


Average length 


Oak 1 


30,207 


5.72 


1 89.45 


22,826 


5.1 1 


200.96 


UdK z 


oo 7Q7 
DO,/ y / 


O.J J 




Zy^yU 


7 71 
/ ./ I 


Ajy.AD 


Oak 3 


31,731 


6.70 


211.29 


24,336 


6.08 


229.92 


Oak 4 


31,874 


7.63 


239.31 


26,063 


7.10 


248.58 


Oak 5 


33,645 


8.68 


258.07 


30,619 


8.38 


180.76 


Oak 6 


30,008 


6.95 


231.72 


23,650 


6.39 


246.25 


Oak 7 


29,307 


7.07 


241.34 


23,909 


6.61 


252.14 


Oak 8 


28,061 


6.49 


231.16 


20,927 


5.61 


244.92 


Total 


253,630 


57.80 


227.85 


201,620 


52.98 


230.35 



Additional file 1. Each allele of an individual was covered 
by an average of 32.64 reads (median 28, Figure 2). Total 
number of effective alleles was 23.561 with a minimum of 
1.253 (RD26), a maximum of 6.656 (BMY7) and a mean 
value of 2.945 (Table 3). Values for expected heterozygosity 
ranged from 0.202 (RD26) to 0.851 (BMY7) with an aver- 
age number of 0.545. Regarding observed heterozygosity, a 
minimum of 0.115 in RD26 and a maximum of 0.640 in 
LEA 14 was detected. The mean value of observed hetero- 
zygosity was found to be 0.347. In all genes expected het- 
erozygosity exceeded the observed. All loci showed 
significant departure from Hardy- Weinberg expectations 
(HWE, Guo and Thompsons exact test [P<0.05]). Null 
alleles were present in a high frequency (above 5%) in all 
loci observed. Tests for linkage disequilibrium (LD) 
revealed significant disequilibrium (P<0.01) within six 
pairwise combinations of loci (Additional file 2: Table S2). 



45,039 




Barcode corrupt 
| Primer corrupt 

Figure 1 Distribution of all reads after preprocessing. Amount 
of reads lost due to technical reasons including sequences shorter 
than 90 bp, reads with corrupt barcode or primer and remaining 
reads are displayed. 



Discussion 

Several studies describe the in silico detection of SNPs 
from the growing amount of EST data available [22,23]. 
However this approach is likely to be biased by the small 
number of individuals coming from a limited amount of 
populations [24]. To overcome this limitation we made 
use of high throughput sequencing techniques and a bio- 
informatic pipeline. 454 Sequencing technology developed 
by Roche has been chosen to analyze coding regions of 
drought responsive candidate genes. The main of advan- 
tage of this technique is the high number of reads gener- 
ated leading to adequate coverage of each allele. Major 
drawbacks include the preferential amplification of short 
sequences and the high abundance of reading errors after 
homopolymer stretches. To cope with these problems we 
have developed a highly customizable pipeline which inte- 
grates preprocessing, statistics and error correction for 454 
amplicon sequencing data and can be used to resolve mul- 
tiplexed samples and to detect variants in a large number 
of sequences. 

Different studies based on molecular markers have been 
conducted in oak. But the majority was investigating the 



Table 2 Recovery statistics and individual coverage 



Gene 


Technical 
loss 


Coverage 
loss 


Remaining 
Individuals Reads 


Average 
coverage 


ARF16 


27 


18 


291 (86.6%) 


9,327 


32.05 


BMY7 


28 


16 


292 (86.9%) 


13,189 


45.17 


DHN2 


22 


99 


n.a. 


n.a. 


n.a. 


ERD8 


8 


9 


319 (94.9%) 


11,941 


37.43 


LEAH 


25 


2 


309 (92.0%) 


16,102 


52.11 


LTP 


18 


5 


313 (93.2%) 


20,409 


65.20 


PER64 


11 


2 


323 (96.1%) 


22,628 


70.06 


PIP IE 


7 


4 


325 (96.7%) 


11,364 


34.97 


RD26 


15 


9 


312 (92.9%) 


9,898 


31.72 


Average 


18 


18 


310 (92.3%) 


14,357 


46.09 


Total 


161 


164 


2,484 


114,858 
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Table 3 Polymorphisms detected 



Gene 


s 


N 


N 


H 


H 


F 


HWE 


Mutation^/1 00 hn 

1 VI 1*1 LCI IIUI 13/ 1 \J\J KJ yj 


ARF16 


10 


10 


1.667 


0.249 


0.404 


0.134 


0.000 


2.75 


BMY7 


19 


52 


6.656 


0.481 


0.851 


0.189 


0.030 


5.97 


rnnn 

LHUo 


26 


27 


2.602 


0.551 


0.61 9 


0.1 1 2 


0.000 


7.24 


LEAH 


34 


49 


5.079 


0.640 


0.803 


0.095 


0.000 


12.23 


LTP 


34 


13 


2.849 


0.260 


0.648 


0.228 


0.000 


13.71 


PER64 


7 


10 


1.817 


0.209 


0.444 


0.182 


0.000 


3.37 


PIP IE 


15 


12 


1.638 


0.274 


0.391 


0.111 


0.000 


4.79 


RD26 


13 


10 


1.253 


0.115 


0.202 


0.115 


0.000 


3.61 


Average 


19.75 


22.88 


2.945 


0.347 


0.545 


0.146 


0.004 


6.71 


Total 


158 


183 


23.561 













S Number of polymorphic sites, N a number of alleles, N e effective number of alleles, H D observed heterozygosity, H e expected heterozygosity, F n frequency of null 
alleles, HWE test for deviation from Hardy-Weinberg equilibrium (p-value) 



population history of Quercus spp. [25] and hybridisation 
between Q. robur and Q. petraea [26]. Only Derory et al 
[27] and Quang et al [28] used SNPs to assess allelic di- 
versity of candidate genes. So far, this is the first study 
which aimed at detecting alleles of drought stress related 
genes of oak. With the aid of the bioinformatic pipeline 
presented above were able to detect a total amount of 
183 alleles within eight genes of 336 oak individuals 
growing in Austria. The amount of alleles is comparable 
to results from white spruce where 173 alleles were 
detected in six loci from 283 individuals growing in Al- 
aska [29]. The average number of 6.71 mutations per 
100 bp is higher or at least in the same range than in 
other plant species. Frequencies of 2.96 to 3.70, 3.85 or 
3.83 to 7.30 have been discovered in maize [30], black 
poplar [31] and eucalyptus [16] which shows that the 




> Q> <b N N \ <b N <b N A <6 N o>' N 0 N N \ N 1 K 

Coverage classes (fold-coverage) 



Figure 2 Allelic coverage distribution. Number of alleles found in 
different classes of fold coverage is displayed. 



number of mutations strongly varies between species 
and gene analyzed. A high level of heterozygosity is 
present within ERD8 and LEA 14 which points towards a 
large amount of genetic variability. Regarding the large 
number of individuals sampled across the heterogenic 
climatic area of Austria, a high variability is expected es- 
pecially in genes like ERD8 and LEA14 which are directly 
involved in processes regulating drought response. 

The high number of discovered alleles will be used as 
valuable source for association studies between allele fre- 
quencies and environmental variables in connection with 
drought stress. SNPs discovered in this study can be used 
to describe the genetic diversity present in two Austrian 
oak species and to develop molecular markers for drought 
tolerance if natural selection can be proven for some loci. 
Significant deviation from Hardy-Weinberg equilibrium 
as well as lower values of observed heterozygosity than 
expected at all loci may be the first evidence for natural 
selection. The deviations could partly be a result of sam- 
pling, be explained by the presence of null alleles (Table 3) 
or might arise from selective pressure on the coding 
regions. Although varying among genes, low levels of 
observed heterozygosity and an excess of homozygotes at 
RD26 support the latter hypothesis as well as the presence 
of linkage disequilibrium between some pairs of loci. This 
might either occur due to epistatic natural selection where 
favourable combinations of genes are linked and function 
together as supergenes [32] or several other factors in- 
cluding gene flow which is well documented among sev- 
eral oak species [27,33,34]. Present linkage between pairs 
of loci is the basis of association studies which can help to 
identify alleles occurring more frequently in plants 
exposed to dry conditions. Linkage between several loci of 
drought reactive genes is generally expected as drought 
response is under control of a huge regulatory network 
[35,36]. However, this approach is limited by the fact that 
rapid decay of LD is commonly observed in forest tree 
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species [37,38] increasing the marker density needed for 
successful association studies. 

Considering these findings, it points out clearly that 
the available dataset should be subject of further ana- 
lysis like decay of LD, estimation of genetic diversity or 
differentiation of populations. To evaluate if selective 
forces were acting, different neutrality test as well as de- 
tection of outlier loci will be necessary. To evaluate if 
these deviations were only generated by demographic 
processes the results should be related to neutral mar- 
kers like microsatellites found in chloroplasts (cpSSRs). 
If the presence of loci under selection and their associ- 
ation with climatic variables like temperature and pre- 
cipitation could be proven, development of molecular 
markers will be possible. Given the predicted climate 
change and the resulting pressure on sessile organisms, 
functional markers might be a valuable tool for forestry 
and a basis for marker assisted selection of drought tol- 
erant genotypes. Additional benefit arises from the fact 
that two different oak species were included in the ana- 
lysis. Although they are mainly sympatric, Q. petraea is 
not as susceptible to drought induced damages as Q. 
robur. If species specific alleles can be discovered and 
related to environmental conditions the genetic basis of 
this benefit may be revealed. 

Conclusions 

Given next generation sequencing data of ten drought 
stress related genes originating from different oak spe- 
cies growing in Austria, we were able to discover new 
SNPs and characterize 183 alleles. In order to obtain 
these results, we have developed a semiautomatic ana- 
lysis pipeline based on freely available tools and scripts. 
This pipeline can be fully automated and provided with 
a graphical user interface to make it more valuable for 
the scientific community. First analysis of the genetic 
data provides evidence of natural selection acting on the 
genes which makes them a target for future evolutionary 
studies and a potential source for molecular marker de- 
velopment. The alleles discovered will definitely help to 
understand drought adaptation processes acting in for- 
est tree species. 

Availability of supporting data 

The raw data supporting the results of this article are 
available in the Dryad repository, http://dx.doi.org/ 
10.5061/dryad.83gfll3b. The data sets supporting the 
results of this article are included within the article and 
its additional files. 

Methods 

Plant material and DNA isolation 

Genomic DNA from 336 oak individuals collected across 
Austria [39,40] was extracted with the DNeasy Plant Mini 



Kit (Qiagen) according to the manufacturers instructions. 
Microarray experiments with long-term drought stress 
applied to oak clones [21] provided the basis for selection 
of ten putative candidate genes for drought adaption 
(Table 4). 

Amplification of candidate genes 

Sequencing adaptors and barcodes were attached to the 
gene of interest following a modified two-step approach 
used by Schuelke [41]. In the first step, genespecific pri- 
mer pairs (internal primers) with M13-tails were used 
to amplify regions of interest. Internal primers targeting 
regions with a minimum of 200 and a maximum of 
375 bp were planned with Primer3 [42] using default 
settings. M13Fw (5 -TGTAAAACGACGGCCAGT-3 ') 
and M13Re (5 -CAGGAAACAGCTATGACC-3 ') were 
synthesized to the 5 '-end of the internal primers. Ampli- 
cons for 454 sequencing were generated from 20 ng of 
template DNA. PCR reactions were performed in 25 ul 
total volume using 5 ul 5 x HOT FIREPol® Blend Master 
Mix 12.5 mM MgCl 2 without dye (Solis BioDyne) and 3 
pmol of each primer. A three-step PCR program consist- 
ing of 15 min. initial denaturation at 95°C followed by 32 
cycles denaturation at 95°C for 30 sec, annealing at a 
temperature of 68°C for 45 sec, extension at 72°C for one 
minute and a final extension at 72°C for 8 minutes was 
used. 

Forward primer for the second amplification con- 
sisted of the 454 sequencing adaptor A (5-CCATCT- 
CATCCCTGCGTGTCTCCGACTCAG-3 ') linked to a 
hexanucleotide barcode sequence corresponding to 
the amplified individual and the M13Fw sequence. 
Reverse primers consisted of the 454 sequencing 
adaptor B (5 - CCTATCCCCTGTGTGCCTTGGCAG 
TCTCAG-3) and the M13Re sequence. HOT FIREPol® 
Blend Master Mix and a primer concentration according 
to Table 5 was used for the amplification including an ini- 
tial denaturation step of 95°C for 15 minutes. Subsequent 
cycling comprised denaturation at 95°C for 20 sec, 
annealing at 55°C for time given in Table 5 and extension 
at 72°C for 15 seconds. The number of cycles was depend- 
ing on the gene region amplified and can be found in 
Table 5. Product size and concentration was verified with 
gel electrophoresis on 1.0% agarose gels. 

Multiplexed 454 amplicon sequencing 

PCR products of forty-two individuals per gene were 
pooled resulting in a total number of 80 pools. Cleaning 
was performed after pooling with the QIAquick PCR 
purification kit (QIAGEN) according to the manufac- 
turers instructions. Concentrations of these pools were 
measured on Nanodrop (Thermo Scientific) and adjusted 
to 30 ng/ul adding TE buffer pH 8.0. PCR products of 
the same individuals originating from different genes 
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Table 4 Summary data of candidate genes 



Gene Annotation Accession Primer pairs T a (°C) Length 



ARF16 


Auxin responsive factor 16 


FP033974 


Fw: 5 '-M 1 3 F w-C AGTTG ATGTTGGTTGG AC A-3 ' 
Re: 5-M 1 3Re-ACAATAAACAAATGCTACTCA-3' 


55 


363 


BMY7 


Beta-amylase 1 


FP061723 


Fw: 5-M13FW-TGCATOCCTCGTOTGATG-3' 
Re: 5-M 1 3Re-TGAAACCATGGGTGAAACCT-3' 


59 


318 


DHN2 


Dehydrin 2 


FP045825 


Fw: 5-M13FW-AGCAGCAGCAAGGTC-3' 

Re: 5-M1 3Re-CCTTGATCTTCTCCATCACTCC-3' 


62 


309 


ERD8 


Heat shock cognate protein 80 


FP041070 


Fw: 5 -M 1 3 F w- AG AATG AC A A ATCAGTG A AGG A-3 ' 
Re: 5 -M 1 3 Re-CGC ATC A AAGC AATAGC AA A-3 ' 


59 


359 


GLP3A 


Germin-like protein 3A 


FP024682 


Fw: 5 '-M 1 3 F w-GG AG ACCTGGGCTTG AACTA-3 ' 
Re: 5'-M13Re-TGTCAATGGCOTGGAAm-3' 


59 


344 


LEAH 


Desiccation protectant protein Leal 4 


FP049285 


Fw: 5 -Ml 3Fw-TGTGCAGATGGGGACATAGA-3' 
Re: 5 -Ml 3Re-TGCAACAAAAATTGAAGATAGGAA-3' 


57 


278 


LTP 


Prolin-rich protein 


FP031479 


Fw: 5-M13FW-GOTOTAGTGOTGCCAAAA-3' 
Re: 5'-M13Re-TCAGAmCAGCCCATCTGAG-3' 


60 


248 


PER64 


Peroxidase 64 


FP051623 


Fw: 5 -Ml 3FW-CAACCACCACAGCATTTGAC-3' 
Re: 5 -Ml 3Re-CCTAATCTOTGCCCACCAG-3' 


62 


208 


PIP IE 


Aquaporin PIP1-2 


FP040796 


Fw: 5 -Ml 3FW-CTGTGGTAAAGGGCTTCCAA-3' 
Re: 5 -M 1 3 Re-AC ACTGC A A ACCC AATAGGC-3' 


58 


313 


RD26 


Responsive to desiccation 26 


FN712698 


Fw: 5'-M13Fw-AA™TOGTGACATOAmG-3' 
Re: 5-M13Re-GGGGCTTTTCCAATATAGAAT-3' 


54 


360 



Accession GeneBank accession number, T a annealing temperature of gene specific primers 



(http://bioinf.comav.upv.es/sff_extract/) with the clipping 
option set (sff_extract -u -c input sff -o output_raw). 
Length and quality statistics of the sequencing runs were 
calculated in R 2.12.1 [43], which was also used for graph- 
ical representation. All sequences shorter than 90 base 
pairs were removed from the analysis with custom Perl 
or shell scripts because these only contained 16 bp with 
possible SNP information (90 bp less 30 bp adaptor, 6 bp 
barcode, 18 bp M13 and 20 bp genespecific primer). 
Sequences longer than 420 bp were trimmed using 
FASTA-Trimmer (fastx_trimmer -I 420 -i inputlong -o 
output_lenclippedfasta) which is part of the FASTX-tool- 
kit (http://hannonlab.cshl.edu/fastx_toolkit/index.html) 
because a drop of quality below a threshold of 20 (454 
quality score) was observed after this length. With the 
aid of the FASTA-Barcode Splitter which is also part of 
the FAS TX- toolkit names of Quercus individuals were 
assigned to the corresponding reads with the exact match 
option set {cat input_lenclippedfasta \ fastx_barcode_ 
splitter.pl -befile Tags_Oak.txt -bol -exact -prefix BC_Oa- 
k_output/ -suffix "fasta ,f ). For this, a textfile (Tags_Oak. 
txt) including the names of the individuals and the bar- 
code sequences was used. Only perfect matching barcodes 
were considered for further analysis. After barcode identi- 
fication, FASTA-Trimmer was used to remove the first six 
bases of each read (fastx_trimmer -f 7 -i inputfasta -o 



were combined and sent for sequencing which was car- 
ried out by GATC Biotech in an 8 gasket format run on 
the Genome Sequencer FLX system (454 Life Sciences) 
with Titanium chemistry. 

Bioinformatic analysis pipeline 

A semiautomatic pipeline for allele identification was 
developed using several publicly available software tools. 
The general workflow is shown in Figure 3. Sequencing 
data delivered in sff format was extracted using sffjextract 

Table 5 PCR settings 

Gene Annealing time (sec.) Cycles Primer concentration (pmol) 



ARF16 


20 


25 


4 


BMY7 


25 


26 


4 


DHN2 


25 


27 


3 


ERD8 


20 


28 


4 


GLP3A 


25 


27 


3 


LEAH 


25 


32 


4 


LTP 


25 


26 


3 


PER64 


25 


28 


4 


PIP IE 


20 


26 


4 


RD26 


25 


26 


3 
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Figure 3 Flow chart of the bioinformatic pipeline developed for the analysis of a 454 amplicon resequencing experiment using freely 
available tools and custom Perl scripts. 



output trimmed). To identify the gene primer cross_match 
(http://www.phrap.org/consed/consed.html) was applied 
with the minimum length of matched word set to eight 
and the minimum alignment score set to ten (cross_match 
inputtrimmed ../Internal_primer.fasta -minmatch 8 -min- 
score 10 > output. crossmatch). Starting position of the 
cross_match hit was extracted from the output (egrep " A 
[0-9]" inputxrossmatch \ awk '$6<40 { printf("%s\t%s\t- 
l\t%s\n", $5, $6, $9) Y > output. crossmatckfilter) and 
used in a custom Perl script to clip Ml 3 primers which 
precede the gene primer sequences. 

To cope with errors resulting from the misinterpret- 
ation of homopolymer stretches by 454 sequencing we 
made use of the assembly tool SCARF [20]. Sanger 
sequences [21] of the amplicons were used as references 
and homopolymer correction was turned on for a mini- 
mum length of 2 bp. Therefore homopolymer errors 
longer than the reference sequence will be trimmed to 
reduce read errors. Minimum percent identity and mini- 
mum hit score were set to 80 and 100, respectively 
(./scarf -f input.fasta -r referrence.fasta -c T -/ 2 -p 80 -s 
100). Alignment of the assembled reads was done with 
MultAlin [44] using the AltDNA symbol comparison 
table and the gap penalty at extremities parameter set to 
end {ma -c: altdna.tab -x:l input. clusters). Reads with 
gaps larger than 10 bp were masked in the alignment 
using a bash script for detection and not considered for 
further analysis. 



SNP and allele detection 

We created an extensive script (FindAlleles) using BioPerl 
[45] which is able to identify mutations and detect alleles. 
Msf alignment files produced by MultAlin serve as input 
and are read using the Bior.AlignIO module. In a next step 
three different consensus sequences are calculated using 
the Bior.SimpleAlign module: consensus_iupac() — consen- 
sus sequence using IUPAC ambiguity codes for DNA, 
consensus_string() — standard consensus sequence display- 
ing bases which occur in plurality and consensus_string 
(40)— produces a standard consensus and marks positions 
with a lower percent-identity than 40 % with "?". A col- 
umn representation of all sequence reads in the alignment 
is created. Using the three consensus sequences created, 
each column is screened for possible allelic distribution or 
arbitrary nucleotide insertions or deletions created by 454 
sequencing technology and both types are annotated with 
a tag. In a next step the script checks each column for an 
allelic distribution of a 40/60 ratio. This correction thresh- 
old value was identified by manually examining randomly 
chosen alignments. In the examined alignments reading 
errors were not present in more than ten percent of the 
reads. The 40/60 distribution might be created by two dif- 
ferent nucleotides (nucleotide variation) or by the inser- 
tion or deletion of a nucleotide (InDel). If this ratio is 
found the position is considered as valid "allelic" distribu- 
tion. If it is not found, it is considered as reading error 
and is automatically corrected with the nucleotide found 
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in the Sanger reference if a previously set correction 
threshold of ten percent is reached. As SCARF only treats 
homopolymer errors longer than the length present in the 
reference, we implemented an additional routine to fill up 
homopolymers shorter than the reference. As we observed 
errors occurring already in homopolymers consisting of 
two nucleotides, the min homopolymer parameter which 
sets the amount of nucleotides in a row to trigger correc- 
tion was set to one. Nucleotide positions exceeding the 
reference are ignored. 

After the correction phase the allele identification pro- 
cedure is started running again over each column. If a 
tag for allelic distribution is found, the process splits up 
the available reads into two subgroups and a recursive 
subroutine is started for each split alignment. Each sub- 
routine receives the associated reads and continues with 
them. Reads shorter than 70 % of the reference were dis- 
carded. Then the tags are checked and split up again if 
an allelic distribution is found. After all subroutines are 
finished alleles are exported to a multiple fasta file as 
well as statistics about the number of reads accounting 
for their creation and a text file with the excluded reads 
are generated. A cluster for allele generation was consid- 
ered as valid if at least five reads covered more than 70 
% of the reference sequence derived from Sanger sequen- 
cing. Number of alleles, observed and expected hetero- 
zygosities, deviation from Hardy- Weinberg equilibrium 
and linkage disequilibrium were calculated with Gene- 
pop 4.0 [46]. Frequency of null alleles was estimated with 
FreeNA [47] and effective number of alleles was calcu- 
lated with GenAlEx 6 [48] using default settings. 

Additional files 



Additional file 1: Self-contained websites displaying alleles and 
genotypes. 

Additional file 2: Table S2. Table displaying linkage disequilibrium 
among pairwise combined loci. 
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